Kodsnack 654 - German-style strings, with Matt Topol
2025-08-05 05:26
Fredrik talks to Matt Topol about Arrow and how the Arrow ecosystem is evolving. Arrow is an open source, columnar in-memory data format designed for efficient data processing and analytics - which means passing data between things without needing to transform it, and ideally even without needing to copy it.
What makes the ecosystem grow, and why is it very cool to have Arrow on the GPU? What is the connection between Arrow, machine learning, and Hugging face? Matt emphasizes the value of open standards, even as they work with or within more closed systems they can help open things up, and help bring about more modular solutions so that developers can focus on doing their core area really well.
This episode can be seen as a follow-up to episode 567, where Matt first joined to discuss everything Arrow.
Recorded during Øredev 2024.
Thank you Cloudnet for sponsoring our VPS!
Comments, questions or tips? We a re @kodsnack, @tobiashieta, @oferlund and @bjoreman on Twitter, have a page on Facebook and can be emailed at info@kodsnack.se if you want to write longer. We read everything we receive.
If you enjoy Kodsnack we would love a review in iTunes! You can also support the podcast by buying us a coffee (or two!) through Ko-fi.
Links
- Matt
- Matt’s Øredev 2023 talks: State of the Apache Arrow ecosystem: How your project can leverage Arrow! and Leveraging Apache Arrow for ML workflows
- Previous episodes with Matt
- Øredev 2024
- Matt’s Øredev 2024 talks - on Arrow ADBC and Composable and modular data systems
- ADBC - Arrow database connectivity
- Arrow
- Snowflake
- Snowflake drivers for ADBC
- Bigquery
- The Bigquery driver
- Microsoft Fabric
- Duckdb
- Postgres
- SQLite
- Arrow flight - RPC framework for services based on Arrow data
- Arrow flight SQL
- Microsoft Power BI
- Velox
- Apache datafusion
- Query planning
- Substrait - query IR
- Polaris
- Libcudf
- Nvidia RAPIDS
- Pytorch
- Tensorflow
- Arrow device interface
- DLPack - in-memory tensor structure
- Tensors
- Nanoarrow
- Voltron data - where Matt used to work. He’s now at Columnar
- Theseus GPU compute engine
- The composable data management system manifesto
- Support us on Ko-fi!
- Matt’s book - In-memory analytics with Apache Arrow
- Spark
- Spark connect
- RPC
- UDFs
- Photon
- Datafusion
- Apache Cassandra
- ODBC
- JDBC
- R - programming language for statistical computing
- Hugging face
- Ray
- Stringview - “German-style strings”
- Scaling up with R and Arrow - the book on using Arrow with R
Titles
- It’s gotten a lot bigger
- The bones of it are in the repo
- (Powered by ADBC)
- Individual compute components
- Feed it substrate
- Where the ecosystem is going
- Arrow on the GPU
- The data stays on the GPU
- A forced copy
- Leverage that device interface
- Without forcing the copy
- Shy of that last mile
- Turtles all the way down
- The guy who said yes
- German-style strings